Overview

Dataset statistics

 Original DataSynthetic Data
Number of variables1515
Number of observations1400014000
Missing cells06
Missing cells (%)0.0%< 0.1%
Duplicate rows74
Duplicate rows (%)0.1%< 0.1%
Total size in memory1.6 MiB1.6 MiB
Average record size in memory120.0 B120.0 B

Variable types

 Original DataSynthetic Data
Numeric65
Categorical910

Alerts

Original DataSynthetic Data
Dataset has 7 (0.1%) duplicate rows Dataset has 4 (< 0.1%) duplicate rowsDuplicates
education_num is highly overall correlated with educationeducation_num is highly overall correlated with educationHigh Correlation
education is highly overall correlated with education_numeducation is highly overall correlated with education_numHigh Correlation
relationship is highly overall correlated with genderrelationship is highly overall correlated with genderHigh Correlation
gender is highly overall correlated with relationshipgender is highly overall correlated with relationshipHigh Correlation
race is highly imbalanced (65.3%) race is highly imbalanced (75.2%) Imbalance
native_country is highly imbalanced (82.5%) native_country is highly imbalanced (84.9%) Imbalance
capital_gain has 12811 (91.5%) zeros Alert not present in Zeros
capital_loss has 13354 (95.4%) zeros capital_loss has 13659 (97.6%) zeros Zeros
Alert not present in capital_gain has a high cardinality: 108 distinct values High Cardinality
Alert not present in capital_gain is highly imbalanced (90.5%) Imbalance
Alert not present in fnlwgt is highly skewed (γ1 = 33.69087698) Skewed
Alert not present in capital_loss is highly skewed (γ1 = 30.03470007) Skewed

Reproduction

 Original DataSynthetic Data
Analysis started2023-01-21 11:11:05.3413142023-01-21 11:11:15.184928
Analysis finished2023-01-21 11:11:15.1615512023-01-21 11:11:22.636424
Duration9.82 seconds7.45 seconds
Software versionpandas-profiling vv3.6.2pandas-profiling vv3.6.2
Download configurationconfig.jsonconfig.json

Variables

age
Real number (ℝ)

 Original DataSynthetic Data
Distinct7269
Distinct (%)0.5%0.5%
Missing00
Missing (%)0.0%0.0%
Infinite00
Infinite (%)0.0%0.0%
Mean38.49271438.523286
 Original DataSynthetic Data
Minimum171
Maximum9090
Zeros00
Zeros (%)0.0%0.0%
Negative00
Negative (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
2023-01-21T11:11:22.851911image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Quantile statistics

 Original DataSynthetic Data
Minimum171
5-th percentile2020
Q12727
median3738
Q34748
95-th percentile6362
Maximum9090
Range7389
Interquartile range (IQR)2021

Descriptive statistics

 Original DataSynthetic Data
Standard deviation13.68402213.439151
Coefficient of variation (CV)0.355496420.34885787
Kurtosis-0.098493391-0.30808631
Mean38.49271438.523286
Median Absolute Deviation (MAD)1010
Skewness0.584165810.4923188
Sum538898539326
Variance187.25246180.61079
MonotonicityNot monotonicNot monotonic
2023-01-21T11:11:23.132398image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
36 381
 
2.7%
23 381
 
2.7%
30 379
 
2.7%
34 373
 
2.7%
31 372
 
2.7%
28 369
 
2.6%
41 368
 
2.6%
37 367
 
2.6%
24 366
 
2.6%
25 364
 
2.6%
Other values (62) 10280
73.4%
ValueCountFrequency (%)
28 515
 
3.7%
25 479
 
3.4%
24 466
 
3.3%
40 449
 
3.2%
41 440
 
3.1%
48 421
 
3.0%
45 412
 
2.9%
21 406
 
2.9%
23 398
 
2.8%
46 397
 
2.8%
Other values (59) 9617
68.7%
ValueCountFrequency (%)
17 184
1.3%
18 237
1.7%
19 273
1.9%
20 341
2.4%
21 326
2.3%
22 341
2.4%
23 381
2.7%
24 366
2.6%
25 364
2.6%
26 334
2.4%
ValueCountFrequency (%)
1 1
 
< 0.1%
17 134
 
1.0%
18 206
1.5%
19 219
1.6%
20 360
2.6%
21 406
2.9%
22 262
1.9%
23 398
2.8%
24 466
3.3%
25 479
3.4%
ValueCountFrequency (%)
1 1
 
< 0.1%
17 134
 
1.0%
18 206
1.5%
19 219
1.6%
20 360
2.6%
21 406
2.9%
22 262
1.9%
23 398
2.8%
24 466
3.3%
25 479
3.4%
ValueCountFrequency (%)
17 184
1.3%
18 237
1.7%
19 273
1.9%
20 341
2.4%
21 326
2.3%
22 341
2.4%
23 381
2.7%
24 366
2.6%
25 364
2.6%
26 334
2.4%

workclass
Categorical

 Original DataSynthetic Data
Distinct88
Distinct (%)0.1%0.1%
Missing00
Missing (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
Private
9743 
Self-emp-not-inc
1116 
Local-gov
 
887
?
 
810
State-gov
 
555
Other values (3)
 
889
Private
10258 
?
 
884
Self-emp-not-inc
 
857
Local-gov
 
759
Self-emp-inc
 
422
Other values (3)
 
820

Length

 Original DataSynthetic Data
Max length1616
Median length77
Mean length7.86457147.6055
Min length11

Characters and Unicode

 Original DataSynthetic Data
Total characters110104106477
Distinct characters2525
Distinct categories44 ?
Distinct scripts22 ?
Distinct blocks11 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

 Original DataSynthetic Data
Unique00 ?
Unique (%)0.0%0.0%

Sample

 Original DataSynthetic Data
1st row??
2nd rowPrivatePrivate
3rd rowPrivatePrivate
4th rowPrivatePrivate
5th row?Private

Common Values

ValueCountFrequency (%)
Private 9743
69.6%
Self-emp-not-inc 1116
 
8.0%
Local-gov 887
 
6.3%
? 810
 
5.8%
State-gov 555
 
4.0%
Self-emp-inc 480
 
3.4%
Federal-gov 399
 
2.9%
Without-pay 10
 
0.1%
ValueCountFrequency (%)
Private 10258
73.3%
? 884
 
6.3%
Self-emp-not-inc 857
 
6.1%
Local-gov 759
 
5.4%
Self-emp-inc 422
 
3.0%
State-gov 420
 
3.0%
Federal-gov 396
 
2.8%
Without-pay 4
 
< 0.1%

Length

2023-01-21T11:11:23.353368image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

Original Data

2023-01-21T11:11:23.583108image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:23.815070image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
ValueCountFrequency (%)
private 9743
69.6%
self-emp-not-inc 1116
 
8.0%
local-gov 887
 
6.3%
810
 
5.8%
state-gov 555
 
4.0%
self-emp-inc 480
 
3.4%
federal-gov 399
 
2.9%
without-pay 10
 
0.1%
ValueCountFrequency (%)
private 10258
73.3%
884
 
6.3%
self-emp-not-inc 857
 
6.1%
local-gov 759
 
5.4%
self-emp-inc 422
 
3.0%
state-gov 420
 
3.0%
federal-gov 396
 
2.8%
without-pay 4
 
< 0.1%

Most occurring characters

ValueCountFrequency (%)
e 14288
13.0%
t 11989
10.9%
a 11594
10.5%
v 11584
10.5%
i 11349
10.3%
r 10142
9.2%
P 9743
8.8%
- 6159
 
5.6%
o 3854
 
3.5%
l 2882
 
2.6%
Other values (15) 16520
15.0%
ValueCountFrequency (%)
e 14028
13.2%
t 11963
11.2%
a 11837
11.1%
v 11833
11.1%
i 11541
10.8%
r 10654
10.0%
P 10258
9.6%
- 4994
 
4.7%
o 3195
 
3.0%
l 2434
 
2.3%
Other values (15) 13740
12.9%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 89945
81.7%
Uppercase Letter 13190
 
12.0%
Dash Punctuation 6159
 
5.6%
Other Punctuation 810
 
0.7%
ValueCountFrequency (%)
Lowercase Letter 87483
82.2%
Uppercase Letter 13116
 
12.3%
Dash Punctuation 4994
 
4.7%
Other Punctuation 884
 
0.8%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e 14288
15.9%
t 11989
13.3%
a 11594
12.9%
v 11584
12.9%
i 11349
12.6%
r 10142
11.3%
o 3854
 
4.3%
l 2882
 
3.2%
n 2712
 
3.0%
c 2483
 
2.8%
Other values (8) 7068
7.9%
ValueCountFrequency (%)
e 14028
16.0%
t 11963
13.7%
a 11837
13.5%
v 11833
13.5%
i 11541
13.2%
r 10654
12.2%
o 3195
 
3.7%
l 2434
 
2.8%
n 2136
 
2.4%
c 2038
 
2.3%
Other values (8) 5824
6.7%
Uppercase Letter
ValueCountFrequency (%)
P 9743
73.9%
S 2151
 
16.3%
L 887
 
6.7%
F 399
 
3.0%
W 10
 
0.1%
ValueCountFrequency (%)
P 10258
78.2%
S 1699
 
13.0%
L 759
 
5.8%
F 396
 
3.0%
W 4
 
< 0.1%
Dash Punctuation
ValueCountFrequency (%)
- 6159
100.0%
ValueCountFrequency (%)
- 4994
100.0%
Other Punctuation
ValueCountFrequency (%)
? 810
100.0%
ValueCountFrequency (%)
? 884
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 103135
93.7%
Common 6969
 
6.3%
ValueCountFrequency (%)
Latin 100599
94.5%
Common 5878
 
5.5%

Most frequent character per script

Latin
ValueCountFrequency (%)
e 14288
13.9%
t 11989
11.6%
a 11594
11.2%
v 11584
11.2%
i 11349
11.0%
r 10142
9.8%
P 9743
9.4%
o 3854
 
3.7%
l 2882
 
2.8%
n 2712
 
2.6%
Other values (13) 12998
12.6%
ValueCountFrequency (%)
e 14028
13.9%
t 11963
11.9%
a 11837
11.8%
v 11833
11.8%
i 11541
11.5%
r 10654
10.6%
P 10258
10.2%
o 3195
 
3.2%
l 2434
 
2.4%
n 2136
 
2.1%
Other values (13) 10720
10.7%
Common
ValueCountFrequency (%)
- 6159
88.4%
? 810
 
11.6%
ValueCountFrequency (%)
- 4994
85.0%
? 884
 
15.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 110104
100.0%
ValueCountFrequency (%)
ASCII 106477
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e 14288
13.0%
t 11989
10.9%
a 11594
10.5%
v 11584
10.5%
i 11349
10.3%
r 10142
9.2%
P 9743
8.8%
- 6159
 
5.6%
o 3854
 
3.5%
l 2882
 
2.6%
Other values (15) 16520
15.0%
ValueCountFrequency (%)
e 14028
13.2%
t 11963
11.2%
a 11837
11.1%
v 11833
11.1%
i 11541
10.8%
r 10654
10.0%
P 10258
9.6%
- 4994
 
4.7%
o 3195
 
3.0%
l 2434
 
2.3%
Other values (15) 13740
12.9%

fnlwgt
Real number (ℝ)

 Original DataSynthetic Data
Distinct112587359
Distinct (%)80.4%52.6%
Missing04
Missing (%)0.0%< 0.1%
Infinite00
Infinite (%)0.0%0.0%
Mean189421.81193648.07
 Original DataSynthetic Data
Minimum122854
Maximum148470515129410
Zeros00
Zeros (%)0.0%0.0%
Negative00
Negative (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
2023-01-21T11:11:24.044622image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Quantile statistics

 Original DataSynthetic Data
Minimum122854
5-th percentile39235.3538772
Q1118095.5116632
median179533178882.5
Q3236858.75234460
95-th percentile377759.05377930.25
Maximum148470515129410
Range147242015129406
Interquartile range (IQR)118763.25117828

Descriptive statistics

 Original DataSynthetic Data
Standard deviation104509.84203944.58
Coefficient of variation (CV)0.551730771.0531712
Kurtosis6.028042147.0835
Mean189421.81193648.07
Median Absolute Deviation (MAD)59954.559719.5
Skewness1.397712333.690877
Sum2.6519054 × 1092.7102984 × 109
Variance1.0922307 × 10104.159339 × 1010
MonotonicityNot monotonicNot monotonic
2023-01-21T11:11:24.343293image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
177675 8
 
0.1%
113364 8
 
0.1%
125461 7
 
0.1%
190290 7
 
0.1%
193882 7
 
0.1%
126675 7
 
0.1%
194901 7
 
0.1%
121124 7
 
0.1%
117963 7
 
0.1%
143582 6
 
< 0.1%
Other values (11248) 13929
99.5%
ValueCountFrequency (%)
148995 21
 
0.1%
104196 19
 
0.1%
32732 18
 
0.1%
190290 18
 
0.1%
143582 17
 
0.1%
99185 17
 
0.1%
417668 17
 
0.1%
272944 16
 
0.1%
144778 16
 
0.1%
193882 15
 
0.1%
Other values (7349) 13822
98.7%
ValueCountFrequency (%)
12285 1
 
< 0.1%
14878 1
 
< 0.1%
19214 1
 
< 0.1%
19302 3
< 0.1%
19395 2
< 0.1%
19410 1
 
< 0.1%
19700 1
 
< 0.1%
19752 1
 
< 0.1%
19793 1
 
< 0.1%
19899 1
 
< 0.1%
ValueCountFrequency (%)
4 1
< 0.1%
108 1
< 0.1%
828 1
< 0.1%
2013 1
< 0.1%
3125 1
< 0.1%
3413 1
< 0.1%
3487 1
< 0.1%
3788 1
< 0.1%
3908 1
< 0.1%
3911 1
< 0.1%
ValueCountFrequency (%)
4 1
< 0.1%
108 1
< 0.1%
828 1
< 0.1%
2013 1
< 0.1%
3125 1
< 0.1%
3413 1
< 0.1%
3487 1
< 0.1%
3788 1
< 0.1%
3908 1
< 0.1%
3911 1
< 0.1%
ValueCountFrequency (%)
12285 1
 
< 0.1%
14878 1
 
< 0.1%
19214 1
 
< 0.1%
19302 3
< 0.1%
19395 2
< 0.1%
19410 1
 
< 0.1%
19700 1
 
< 0.1%
19752 1
 
< 0.1%
19793 1
 
< 0.1%
19899 1
 
< 0.1%

education
Categorical

 Original DataSynthetic Data
Distinct1616
Distinct (%)0.1%0.1%
Missing00
Missing (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
HS-grad
4452 
Some-college
3163 
Bachelors
2319 
Masters
734 
Assoc-voc
624 
Other values (11)
2708 
HS-grad
4650 
Some-college
3020 
Bachelors
2522 
Assoc-voc
707 
Masters
650 
Other values (11)
2451 

Length

 Original DataSynthetic Data
Max length1212
Median length1111
Mean length8.4398.4553571
Min length33

Characters and Unicode

 Original DataSynthetic Data
Total characters118146118375
Distinct characters3131
Distinct categories44 ?
Distinct scripts22 ?
Distinct blocks11 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

 Original DataSynthetic Data
Unique00 ?
Unique (%)0.0%0.0%

Sample

 Original DataSynthetic Data
1st row11th11th
2nd rowDoctorate10th
3rd rowBachelors9th
4th rowAssoc-acdm5th-6th
5th rowSome-collegeHS-grad

Common Values

ValueCountFrequency (%)
HS-grad 4452
31.8%
Some-college 3163
22.6%
Bachelors 2319
16.6%
Masters 734
 
5.2%
Assoc-voc 624
 
4.5%
11th 528
 
3.8%
Assoc-acdm 452
 
3.2%
10th 408
 
2.9%
7th-8th 297
 
2.1%
Prof-school 233
 
1.7%
Other values (6) 790
 
5.6%
ValueCountFrequency (%)
HS-grad 4650
33.2%
Some-college 3020
21.6%
Bachelors 2522
18.0%
Assoc-voc 707
 
5.1%
Masters 650
 
4.6%
11th 470
 
3.4%
Assoc-acdm 407
 
2.9%
10th 355
 
2.5%
7th-8th 299
 
2.1%
Prof-school 202
 
1.4%
Other values (6) 718
 
5.1%

Length

2023-01-21T11:11:24.582590image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

Original Data


Number of variable categories passes threshold (config.plot.cat_freq.max_unique)

Synthetic Data


Number of variable categories passes threshold (config.plot.cat_freq.max_unique)
ValueCountFrequency (%)
hs-grad 4452
31.8%
some-college 3163
22.6%
bachelors 2319
16.6%
masters 734
 
5.2%
assoc-voc 624
 
4.5%
11th 528
 
3.8%
assoc-acdm 452
 
3.2%
10th 408
 
2.9%
7th-8th 297
 
2.1%
prof-school 233
 
1.7%
Other values (6) 790
 
5.6%
ValueCountFrequency (%)
hs-grad 4650
33.2%
some-college 3020
21.6%
bachelors 2522
18.0%
assoc-voc 707
 
5.1%
masters 650
 
4.6%
11th 470
 
3.4%
assoc-acdm 407
 
2.9%
10th 355
 
2.5%
7th-8th 299
 
2.1%
prof-school 202
 
1.4%
Other values (6) 718
 
5.1%

Most occurring characters

ValueCountFrequency (%)
e 12723
10.8%
o 11406
 
9.7%
- 9434
 
8.0%
l 8894
 
7.5%
a 8122
 
6.9%
c 8048
 
6.8%
r 7919
 
6.7%
g 7615
 
6.4%
S 7615
 
6.4%
s 6256
 
5.3%
Other values (21) 30114
25.5%
ValueCountFrequency (%)
e 12421
10.5%
o 11367
 
9.6%
- 9495
 
8.0%
l 8783
 
7.4%
a 8399
 
7.1%
r 8213
 
6.9%
c 8161
 
6.9%
g 7670
 
6.5%
S 7670
 
6.5%
s 6329
 
5.3%
Other values (21) 29867
25.2%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 88627
75.0%
Uppercase Letter 16610
 
14.1%
Dash Punctuation 9434
 
8.0%
Decimal Number 3475
 
2.9%
ValueCountFrequency (%)
Lowercase Letter 88735
75.0%
Uppercase Letter 16997
 
14.4%
Dash Punctuation 9495
 
8.0%
Decimal Number 3148
 
2.7%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e 12723
14.4%
o 11406
12.9%
l 8894
10.0%
a 8122
9.2%
c 8048
9.1%
r 7919
8.9%
g 7615
8.6%
s 6256
7.1%
d 4904
 
5.5%
h 4852
 
5.5%
Other values (4) 7888
8.9%
ValueCountFrequency (%)
e 12421
14.0%
o 11367
12.8%
l 8783
9.9%
a 8399
9.5%
r 8213
9.3%
c 8161
9.2%
g 7670
8.6%
s 6329
7.1%
d 5057
5.7%
h 4847
 
5.5%
Other values (4) 7488
8.4%
Dash Punctuation
ValueCountFrequency (%)
- 9434
100.0%
ValueCountFrequency (%)
- 9495
100.0%
Uppercase Letter
ValueCountFrequency (%)
S 7615
45.8%
H 4452
26.8%
B 2319
 
14.0%
A 1076
 
6.5%
M 734
 
4.4%
P 249
 
1.5%
D 165
 
1.0%
ValueCountFrequency (%)
S 7670
45.1%
H 4650
27.4%
B 2522
 
14.8%
A 1114
 
6.6%
M 650
 
3.8%
P 221
 
1.3%
D 170
 
1.0%
Decimal Number
ValueCountFrequency (%)
1 1719
49.5%
0 408
 
11.7%
7 297
 
8.5%
8 297
 
8.5%
9 209
 
6.0%
2 187
 
5.4%
5 145
 
4.2%
6 145
 
4.2%
4 68
 
2.0%
ValueCountFrequency (%)
1 1514
48.1%
0 355
 
11.3%
7 299
 
9.5%
8 299
 
9.5%
2 161
 
5.1%
9 158
 
5.0%
5 152
 
4.8%
6 152
 
4.8%
4 58
 
1.8%

Most occurring scripts

ValueCountFrequency (%)
Latin 105237
89.1%
Common 12909
 
10.9%
ValueCountFrequency (%)
Latin 105732
89.3%
Common 12643
 
10.7%

Most frequent character per script

Latin
ValueCountFrequency (%)
e 12723
12.1%
o 11406
10.8%
l 8894
8.5%
a 8122
 
7.7%
c 8048
 
7.6%
r 7919
 
7.5%
g 7615
 
7.2%
S 7615
 
7.2%
s 6256
 
5.9%
d 4904
 
4.7%
Other values (11) 21735
20.7%
ValueCountFrequency (%)
e 12421
11.7%
o 11367
10.8%
l 8783
8.3%
a 8399
 
7.9%
r 8213
 
7.8%
c 8161
 
7.7%
g 7670
 
7.3%
S 7670
 
7.3%
s 6329
 
6.0%
d 5057
 
4.8%
Other values (11) 21662
20.5%
Common
ValueCountFrequency (%)
- 9434
73.1%
1 1719
 
13.3%
0 408
 
3.2%
7 297
 
2.3%
8 297
 
2.3%
9 209
 
1.6%
2 187
 
1.4%
5 145
 
1.1%
6 145
 
1.1%
4 68
 
0.5%
ValueCountFrequency (%)
- 9495
75.1%
1 1514
 
12.0%
0 355
 
2.8%
7 299
 
2.4%
8 299
 
2.4%
2 161
 
1.3%
9 158
 
1.2%
5 152
 
1.2%
6 152
 
1.2%
4 58
 
0.5%

Most occurring blocks

ValueCountFrequency (%)
ASCII 118146
100.0%
ValueCountFrequency (%)
ASCII 118375
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e 12723
10.8%
o 11406
 
9.7%
- 9434
 
8.0%
l 8894
 
7.5%
a 8122
 
6.9%
c 8048
 
6.8%
r 7919
 
6.7%
g 7615
 
6.4%
S 7615
 
6.4%
s 6256
 
5.3%
Other values (21) 30114
25.5%
ValueCountFrequency (%)
e 12421
10.5%
o 11367
 
9.6%
- 9495
 
8.0%
l 8783
 
7.4%
a 8399
 
7.1%
r 8213
 
6.9%
c 8161
 
6.9%
g 7670
 
6.5%
S 7670
 
6.5%
s 6329
 
5.3%
Other values (21) 29867
25.2%

education_num
Real number (ℝ)

 Original DataSynthetic Data
Distinct1616
Distinct (%)0.1%0.1%
Missing00
Missing (%)0.0%0.0%
Infinite00
Infinite (%)0.0%0.0%
Mean10.07171410.116571
 Original DataSynthetic Data
Minimum11
Maximum1616
Zeros00
Zeros (%)0.0%0.0%
Negative00
Negative (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
2023-01-21T11:11:24.782701image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Quantile statistics

 Original DataSynthetic Data
Minimum11
5-th percentile56
Q199
median1010
Q31213
95-th percentile1414
Maximum1616
Range1515
Interquartile range (IQR)34

Descriptive statistics

 Original DataSynthetic Data
Standard deviation2.56203832.521849
Coefficient of variation (CV)0.254379560.24927902
Kurtosis0.587444840.71457805
Mean10.07171410.116571
Median Absolute Deviation (MAD)11
Skewness-0.31432088-0.34253642
Sum141004141632
Variance6.56404026.3597225
MonotonicityNot monotonicNot monotonic
2023-01-21T11:11:24.984172image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram with fixed size bins (bins=16)
ValueCountFrequency (%)
9 4452
31.8%
10 3163
22.6%
13 2319
16.6%
14 734
 
5.2%
11 624
 
4.5%
7 528
 
3.8%
12 452
 
3.2%
6 408
 
2.9%
4 297
 
2.1%
15 233
 
1.7%
Other values (6) 790
 
5.6%
ValueCountFrequency (%)
9 4650
33.2%
10 3020
21.6%
13 2522
18.0%
11 707
 
5.1%
14 650
 
4.6%
7 470
 
3.4%
12 407
 
2.9%
6 355
 
2.5%
4 299
 
2.1%
15 202
 
1.4%
Other values (6) 718
 
5.1%
ValueCountFrequency (%)
1 16
 
0.1%
2 68
 
0.5%
3 145
 
1.0%
4 297
 
2.1%
5 209
 
1.5%
6 408
 
2.9%
7 528
 
3.8%
8 187
 
1.3%
9 4452
31.8%
10 3163
22.6%
ValueCountFrequency (%)
1 19
 
0.1%
2 58
 
0.4%
3 152
 
1.1%
4 299
 
2.1%
5 158
 
1.1%
6 355
 
2.5%
7 470
 
3.4%
8 161
 
1.1%
9 4650
33.2%
10 3020
21.6%
ValueCountFrequency (%)
1 19
 
0.1%
2 58
 
0.4%
3 152
 
1.1%
4 299
 
2.1%
5 158
 
1.1%
6 355
 
2.5%
7 470
 
3.4%
8 161
 
1.1%
9 4650
33.2%
10 3020
21.6%
ValueCountFrequency (%)
1 16
 
0.1%
2 68
 
0.5%
3 145
 
1.0%
4 297
 
2.1%
5 209
 
1.5%
6 408
 
2.9%
7 528
 
3.8%
8 187
 
1.3%
9 4452
31.8%
10 3163
22.6%

marital_status
Categorical

 Original DataSynthetic Data
Distinct79
Distinct (%)0.1%0.1%
Missing00
Missing (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
Married-civ-spouse
6417 
Never-married
4661 
Divorced
1880 
Separated
 
439
Widowed
 
426
Other values (2)
 
177
Married-civ-spouse
6269 
Never-married
4932 
Divorced
2023 
Separated
 
387
Widowed
 
368
Other values (4)
 
21

Length

 Original DataSynthetic Data
Max length2121
Median length1818
Mean length14.41207114.258786
Min length77

Characters and Unicode

 Original DataSynthetic Data
Total characters201769199623
Distinct characters2424
Distinct categories33 ?
Distinct scripts22 ?
Distinct blocks11 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

 Original DataSynthetic Data
Unique03 ?
Unique (%)0.0%< 0.1%

Sample

 Original DataSynthetic Data
1st rowNever-marriedMarried-civ-spouse
2nd rowDivorcedMarried-civ-spouse
3rd rowNever-marriedMarried-civ-spouse
4th rowDivorcedMarried-civ-spouse
5th rowNever-marriedNever-married

Common Values

ValueCountFrequency (%)
Married-civ-spouse 6417
45.8%
Never-married 4661
33.3%
Divorced 1880
 
13.4%
Separated 439
 
3.1%
Widowed 426
 
3.0%
Married-spouse-absent 172
 
1.2%
Married-AF-spouse 5
 
< 0.1%
ValueCountFrequency (%)
Married-civ-spouse 6269
44.8%
Never-married 4932
35.2%
Divorced 2023
 
14.4%
Separated 387
 
2.8%
Widowed 368
 
2.6%
Married-spouse-absent 18
 
0.1%
rried-civ-spouse 1
 
< 0.1%
-civ-spouse 1
 
< 0.1%
Married-AF-spouse 1
 
< 0.1%

Length

2023-01-21T11:11:25.194773image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

Original Data

2023-01-21T11:11:25.420617image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:25.655176image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
ValueCountFrequency (%)
married-civ-spouse 6417
45.8%
never-married 4661
33.3%
divorced 1880
 
13.4%
separated 439
 
3.1%
widowed 426
 
3.0%
married-spouse-absent 172
 
1.2%
married-af-spouse 5
 
< 0.1%
ValueCountFrequency (%)
married-civ-spouse 6269
44.8%
never-married 4932
35.2%
divorced 2023
 
14.4%
separated 387
 
2.8%
widowed 368
 
2.6%
married-spouse-absent 18
 
0.1%
rried-civ-spouse 1
 
< 0.1%
civ-spouse 1
 
< 0.1%
married-af-spouse 1
 
< 0.1%

Most occurring characters

ValueCountFrequency (%)
e 30527
15.1%
r 29490
14.6%
i 19978
9.9%
- 17849
8.8%
d 14426
7.1%
s 13360
 
6.6%
v 12958
 
6.4%
a 12305
 
6.1%
o 8900
 
4.4%
c 8297
 
4.1%
Other values (14) 33679
16.7%
ValueCountFrequency (%)
e 30558
15.3%
r 29784
14.9%
i 19883
10.0%
- 17512
8.8%
d 14367
7.2%
v 13226
6.6%
s 12598
 
6.3%
a 12012
 
6.0%
o 8681
 
4.3%
c 8294
 
4.2%
Other values (14) 32708
16.4%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 169910
84.2%
Dash Punctuation 17849
 
8.8%
Uppercase Letter 14010
 
6.9%
ValueCountFrequency (%)
Lowercase Letter 168111
84.2%
Dash Punctuation 17512
 
8.8%
Uppercase Letter 14000
 
7.0%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e 30527
18.0%
r 29490
17.4%
i 19978
11.8%
d 14426
8.5%
s 13360
7.9%
v 12958
7.6%
a 12305
7.2%
o 8900
 
5.2%
c 8297
 
4.9%
p 7033
 
4.1%
Other values (6) 12636
7.4%
ValueCountFrequency (%)
e 30558
18.2%
r 29784
17.7%
i 19883
11.8%
d 14367
8.5%
v 13226
7.9%
s 12598
7.5%
a 12012
 
7.1%
o 8681
 
5.2%
c 8294
 
4.9%
p 6677
 
4.0%
Other values (6) 12031
 
7.2%
Dash Punctuation
ValueCountFrequency (%)
- 17849
100.0%
ValueCountFrequency (%)
- 17512
100.0%
Uppercase Letter
ValueCountFrequency (%)
M 6594
47.1%
N 4661
33.3%
D 1880
 
13.4%
S 439
 
3.1%
W 426
 
3.0%
A 5
 
< 0.1%
F 5
 
< 0.1%
ValueCountFrequency (%)
M 6288
44.9%
N 4932
35.2%
D 2023
 
14.4%
S 387
 
2.8%
W 368
 
2.6%
A 1
 
< 0.1%
F 1
 
< 0.1%

Most occurring scripts

ValueCountFrequency (%)
Latin 183920
91.2%
Common 17849
 
8.8%
ValueCountFrequency (%)
Latin 182111
91.2%
Common 17512
 
8.8%

Most frequent character per script

Latin
ValueCountFrequency (%)
e 30527
16.6%
r 29490
16.0%
i 19978
10.9%
d 14426
7.8%
s 13360
7.3%
v 12958
7.0%
a 12305
6.7%
o 8900
 
4.8%
c 8297
 
4.5%
p 7033
 
3.8%
Other values (13) 26646
14.5%
ValueCountFrequency (%)
e 30558
16.8%
r 29784
16.4%
i 19883
10.9%
d 14367
7.9%
v 13226
7.3%
s 12598
6.9%
a 12012
 
6.6%
o 8681
 
4.8%
c 8294
 
4.6%
p 6677
 
3.7%
Other values (13) 26031
14.3%
Common
ValueCountFrequency (%)
- 17849
100.0%
ValueCountFrequency (%)
- 17512
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 201769
100.0%
ValueCountFrequency (%)
ASCII 199623
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e 30527
15.1%
r 29490
14.6%
i 19978
9.9%
- 17849
8.8%
d 14426
7.1%
s 13360
 
6.6%
v 12958
 
6.4%
a 12305
 
6.1%
o 8900
 
4.4%
c 8297
 
4.1%
Other values (14) 33679
16.7%
ValueCountFrequency (%)
e 30558
15.3%
r 29784
14.9%
i 19883
10.0%
- 17512
8.8%
d 14367
7.2%
v 13226
6.6%
s 12598
 
6.3%
a 12012
 
6.0%
o 8681
 
4.3%
c 8294
 
4.2%
Other values (14) 32708
16.4%

occupation
Categorical

 Original DataSynthetic Data
Distinct1519
Distinct (%)0.1%0.1%
Missing00
Missing (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
Craft-repair
1766 
Exec-managerial
1753 
Prof-specialty
1749 
Adm-clerical
1620 
Sales
1575 
Other values (10)
5537 
Adm-clerical
2310 
Craft-repair
2161 
Prof-specialty
1780 
Exec-managerial
1598 
Other-service
1277 
Other values (14)
4874 

Length

 Original DataSynthetic Data
Max length1717
Median length1515
Mean length12.18812.230929
Min length11

Characters and Unicode

 Original DataSynthetic Data
Total characters170632171233
Distinct characters3232
Distinct categories44 ?
Distinct scripts22 ?
Distinct blocks11 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

 Original DataSynthetic Data
Unique04 ?
Unique (%)0.0%< 0.1%

Sample

 Original DataSynthetic Data
1st row??
2nd rowSales?
3rd rowCraft-repair?
4th rowSalesSales
5th row??

Common Values

ValueCountFrequency (%)
Craft-repair 1766
12.6%
Exec-managerial 1753
12.5%
Prof-specialty 1749
12.5%
Adm-clerical 1620
11.6%
Sales 1575
11.2%
Other-service 1416
10.1%
Machine-op-inspct 875
6.2%
? 810
5.8%
Transport-moving 694
 
5.0%
Handlers-cleaners 580
 
4.1%
Other values (5) 1162
8.3%
ValueCountFrequency (%)
Adm-clerical 2310
16.5%
Craft-repair 2161
15.4%
Prof-specialty 1780
12.7%
Exec-managerial 1598
11.4%
Other-service 1277
9.1%
Sales 1244
8.9%
Transport-moving 884
 
6.3%
? 765
 
5.5%
Machine-op-inspct 602
 
4.3%
Handlers-cleaners 473
 
3.4%
Other values (9) 906
 
6.5%

Length

2023-01-21T11:11:25.884436image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

Original Data


Number of variable categories passes threshold (config.plot.cat_freq.max_unique)

Synthetic Data


Number of variable categories passes threshold (config.plot.cat_freq.max_unique)
ValueCountFrequency (%)
craft-repair 1766
12.6%
exec-managerial 1753
12.5%
prof-specialty 1749
12.5%
adm-clerical 1620
11.6%
sales 1575
11.2%
other-service 1416
10.1%
machine-op-inspct 875
6.2%
810
5.8%
transport-moving 694
 
5.0%
handlers-cleaners 580
 
4.1%
Other values (5) 1162
8.3%
ValueCountFrequency (%)
adm-clerical 2310
16.5%
craft-repair 2161
15.4%
prof-specialty 1780
12.7%
exec-managerial 1598
11.4%
other-service 1277
9.1%
sales 1244
8.9%
transport-moving 884
 
6.3%
765
 
5.5%
machine-op-inspct 602
 
4.3%
handlers-cleaners 473
 
3.4%
Other values (9) 906
 
6.5%

Most occurring characters

ValueCountFrequency (%)
e 18492
 
10.8%
r 17334
 
10.2%
a 16877
 
9.9%
- 12566
 
7.4%
i 12355
 
7.2%
c 11161
 
6.5%
l 9477
 
5.6%
s 8707
 
5.1%
t 7461
 
4.4%
n 6877
 
4.0%
Other values (22) 49325
28.9%
ValueCountFrequency (%)
r 18533
 
10.8%
e 17354
 
10.1%
a 17294
 
10.1%
- 12684
 
7.4%
i 12654
 
7.4%
c 11363
 
6.6%
l 10190
 
6.0%
s 7722
 
4.5%
t 7230
 
4.2%
p 6617
 
3.9%
Other values (22) 49592
29.0%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 144062
84.4%
Uppercase Letter 13194
 
7.7%
Dash Punctuation 12566
 
7.4%
Other Punctuation 810
 
0.5%
ValueCountFrequency (%)
Lowercase Letter 144550
84.4%
Uppercase Letter 13234
 
7.7%
Dash Punctuation 12684
 
7.4%
Other Punctuation 765
 
0.4%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e 18492
12.8%
r 17334
12.0%
a 16877
11.7%
i 12355
8.6%
c 11161
 
7.7%
l 9477
 
6.6%
s 8707
 
6.0%
t 7461
 
5.2%
n 6877
 
4.8%
p 6713
 
4.7%
Other values (10) 28608
19.9%
ValueCountFrequency (%)
r 18533
12.8%
e 17354
12.0%
a 17294
12.0%
i 12654
8.8%
c 11363
 
7.9%
l 10190
 
7.0%
s 7722
 
5.3%
t 7230
 
5.0%
p 6617
 
4.6%
n 6345
 
4.4%
Other values (10) 29248
20.2%
Dash Punctuation
ValueCountFrequency (%)
- 12566
100.0%
ValueCountFrequency (%)
- 12684
100.0%
Uppercase Letter
ValueCountFrequency (%)
P 2117
16.0%
C 1766
13.4%
E 1753
13.3%
A 1624
12.3%
S 1575
11.9%
O 1416
10.7%
T 1071
8.1%
M 875
6.6%
H 580
 
4.4%
F 417
 
3.2%
ValueCountFrequency (%)
A 2310
17.5%
C 2161
16.3%
P 1979
15.0%
E 1599
12.1%
O 1277
9.6%
S 1244
9.4%
T 1179
8.9%
M 602
 
4.5%
H 473
 
3.6%
F 410
 
3.1%
Other Punctuation
ValueCountFrequency (%)
? 810
100.0%
ValueCountFrequency (%)
? 765
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 157256
92.2%
Common 13376
 
7.8%
ValueCountFrequency (%)
Latin 157784
92.1%
Common 13449
 
7.9%

Most frequent character per script

Latin
ValueCountFrequency (%)
e 18492
11.8%
r 17334
11.0%
a 16877
10.7%
i 12355
 
7.9%
c 11161
 
7.1%
l 9477
 
6.0%
s 8707
 
5.5%
t 7461
 
4.7%
n 6877
 
4.4%
p 6713
 
4.3%
Other values (20) 41802
26.6%
ValueCountFrequency (%)
r 18533
11.7%
e 17354
11.0%
a 17294
11.0%
i 12654
 
8.0%
c 11363
 
7.2%
l 10190
 
6.5%
s 7722
 
4.9%
t 7230
 
4.6%
p 6617
 
4.2%
n 6345
 
4.0%
Other values (20) 42482
26.9%
Common
ValueCountFrequency (%)
- 12566
93.9%
? 810
 
6.1%
ValueCountFrequency (%)
- 12684
94.3%
? 765
 
5.7%

Most occurring blocks

ValueCountFrequency (%)
ASCII 170632
100.0%
ValueCountFrequency (%)
ASCII 171233
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e 18492
 
10.8%
r 17334
 
10.2%
a 16877
 
9.9%
- 12566
 
7.4%
i 12355
 
7.2%
c 11161
 
6.5%
l 9477
 
5.6%
s 8707
 
5.1%
t 7461
 
4.4%
n 6877
 
4.0%
Other values (22) 49325
28.9%
ValueCountFrequency (%)
r 18533
 
10.8%
e 17354
 
10.1%
a 17294
 
10.1%
- 12684
 
7.4%
i 12654
 
7.4%
c 11363
 
6.6%
l 10190
 
6.0%
s 7722
 
4.5%
t 7230
 
4.2%
p 6617
 
3.9%
Other values (22) 49592
29.0%

relationship
Categorical

 Original DataSynthetic Data
Distinct68
Distinct (%)< 0.1%0.1%
Missing00
Missing (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
Husband
5627 
Not-in-family
3579 
Own-child
2253 
Unmarried
1433 
Wife
689 
Husband
5396 
Not-in-family
3646 
Own-child
2710 
Unmarried
1168 
Wife
776 
Other values (3)
 
304

Length

 Original DataSynthetic Data
Max length1414
Median length1313
Mean length9.12228579.1011429
Min length42

Characters and Unicode

 Original DataSynthetic Data
Total characters127712127416
Distinct characters2525
Distinct categories33 ?
Distinct scripts22 ?
Distinct blocks11 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

 Original DataSynthetic Data
Unique02 ?
Unique (%)0.0%< 0.1%

Sample

 Original DataSynthetic Data
1st rowUnmarriedHusband
2nd rowNot-in-familyHusband
3rd rowNot-in-familyHusband
4th rowNot-in-familyHusband
5th rowUnmarriedOwn-child

Common Values

ValueCountFrequency (%)
Husband 5627
40.2%
Not-in-family 3579
25.6%
Own-child 2253
16.1%
Unmarried 1433
 
10.2%
Wife 689
 
4.9%
Other-relative 419
 
3.0%
ValueCountFrequency (%)
Husband 5396
38.5%
Not-in-family 3646
26.0%
Own-child 2710
19.4%
Unmarried 1168
 
8.3%
Wife 776
 
5.5%
Other-relative 302
 
2.2%
ld 1
 
< 0.1%
Ownmarried 1
 
< 0.1%

Length

2023-01-21T11:11:26.090446image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

Original Data

2023-01-21T11:11:26.306746image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:26.518204image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
ValueCountFrequency (%)
husband 5627
40.2%
not-in-family 3579
25.6%
own-child 2253
16.1%
unmarried 1433
 
10.2%
wife 689
 
4.9%
other-relative 419
 
3.0%
ValueCountFrequency (%)
husband 5396
38.5%
not-in-family 3646
26.0%
own-child 2710
19.4%
unmarried 1168
 
8.3%
wife 776
 
5.5%
other-relative 302
 
2.2%
ld 1
 
< 0.1%
ownmarried 1
 
< 0.1%

Most occurring characters

ValueCountFrequency (%)
n 12892
 
10.1%
i 11952
 
9.4%
a 11058
 
8.7%
- 9830
 
7.7%
d 9313
 
7.3%
l 6251
 
4.9%
H 5627
 
4.4%
u 5627
 
4.4%
s 5627
 
4.4%
b 5627
 
4.4%
Other values (15) 43908
34.4%
ValueCountFrequency (%)
n 12921
 
10.1%
i 12249
 
9.6%
a 10513
 
8.3%
- 10304
 
8.1%
d 9276
 
7.3%
l 6659
 
5.2%
H 5396
 
4.2%
s 5396
 
4.2%
b 5396
 
4.2%
u 5396
 
4.2%
Other values (15) 43910
34.5%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 103882
81.3%
Uppercase Letter 14000
 
11.0%
Dash Punctuation 9830
 
7.7%
ValueCountFrequency (%)
Lowercase Letter 103113
80.9%
Uppercase Letter 13999
 
11.0%
Dash Punctuation 10304
 
8.1%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
n 12892
12.4%
i 11952
11.5%
a 11058
10.6%
d 9313
 
9.0%
l 6251
 
6.0%
u 5627
 
5.4%
s 5627
 
5.4%
b 5627
 
5.4%
m 5012
 
4.8%
t 4417
 
4.3%
Other values (9) 26106
25.1%
ValueCountFrequency (%)
n 12921
12.5%
i 12249
11.9%
a 10513
10.2%
d 9276
 
9.0%
l 6659
 
6.5%
s 5396
 
5.2%
b 5396
 
5.2%
u 5396
 
5.2%
m 4815
 
4.7%
f 4422
 
4.3%
Other values (9) 26070
25.3%
Dash Punctuation
ValueCountFrequency (%)
- 9830
100.0%
ValueCountFrequency (%)
- 10304
100.0%
Uppercase Letter
ValueCountFrequency (%)
H 5627
40.2%
N 3579
25.6%
O 2672
19.1%
U 1433
 
10.2%
W 689
 
4.9%
ValueCountFrequency (%)
H 5396
38.5%
N 3646
26.0%
O 3013
21.5%
U 1168
 
8.3%
W 776
 
5.5%

Most occurring scripts

ValueCountFrequency (%)
Latin 117882
92.3%
Common 9830
 
7.7%
ValueCountFrequency (%)
Latin 117112
91.9%
Common 10304
 
8.1%

Most frequent character per script

Latin
ValueCountFrequency (%)
n 12892
 
10.9%
i 11952
 
10.1%
a 11058
 
9.4%
d 9313
 
7.9%
l 6251
 
5.3%
H 5627
 
4.8%
u 5627
 
4.8%
s 5627
 
4.8%
b 5627
 
4.8%
m 5012
 
4.3%
Other values (14) 38896
33.0%
ValueCountFrequency (%)
n 12921
 
11.0%
i 12249
 
10.5%
a 10513
 
9.0%
d 9276
 
7.9%
l 6659
 
5.7%
H 5396
 
4.6%
s 5396
 
4.6%
b 5396
 
4.6%
u 5396
 
4.6%
m 4815
 
4.1%
Other values (14) 39095
33.4%
Common
ValueCountFrequency (%)
- 9830
100.0%
ValueCountFrequency (%)
- 10304
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 127712
100.0%
ValueCountFrequency (%)
ASCII 127416
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
n 12892
 
10.1%
i 11952
 
9.4%
a 11058
 
8.7%
- 9830
 
7.7%
d 9313
 
7.3%
l 6251
 
4.9%
H 5627
 
4.4%
u 5627
 
4.4%
s 5627
 
4.4%
b 5627
 
4.4%
Other values (15) 43908
34.4%
ValueCountFrequency (%)
n 12921
 
10.1%
i 12249
 
9.6%
a 10513
 
8.3%
- 10304
 
8.1%
d 9276
 
7.3%
l 6659
 
5.2%
H 5396
 
4.2%
s 5396
 
4.2%
b 5396
 
4.2%
u 5396
 
4.2%
Other values (15) 43910
34.5%

race
Categorical

 Original DataSynthetic Data
Distinct55
Distinct (%)< 0.1%< 0.1%
Missing00
Missing (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
White
11935 
Black
1356 
Asian-Pac-Islander
 
463
Amer-Indian-Eskimo
 
132
Other
 
114
White
12736 
Black
 
738
Asian-Pac-Islander
 
330
Amer-Indian-Eskimo
 
113
Other
 
83

Length

 Original DataSynthetic Data
Max length1818
Median length55
Mean length5.55255.4113571
Min length55

Characters and Unicode

 Original DataSynthetic Data
Total characters7773575759
Distinct characters2222
Distinct categories33 ?
Distinct scripts22 ?
Distinct blocks11 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

 Original DataSynthetic Data
Unique00 ?
Unique (%)0.0%0.0%

Sample

 Original DataSynthetic Data
1st rowWhiteWhite
2nd rowWhiteWhite
3rd rowWhiteWhite
4th rowWhiteWhite
5th rowWhiteWhite

Common Values

ValueCountFrequency (%)
White 11935
85.2%
Black 1356
 
9.7%
Asian-Pac-Islander 463
 
3.3%
Amer-Indian-Eskimo 132
 
0.9%
Other 114
 
0.8%
ValueCountFrequency (%)
White 12736
91.0%
Black 738
 
5.3%
Asian-Pac-Islander 330
 
2.4%
Amer-Indian-Eskimo 113
 
0.8%
Other 83
 
0.6%

Length

2023-01-21T11:11:26.691622image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

Original Data

2023-01-21T11:11:26.920274image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:27.088629image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
ValueCountFrequency (%)
white 11935
85.2%
black 1356
 
9.7%
asian-pac-islander 463
 
3.3%
amer-indian-eskimo 132
 
0.9%
other 114
 
0.8%
ValueCountFrequency (%)
white 12736
91.0%
black 738
 
5.3%
asian-pac-islander 330
 
2.4%
amer-indian-eskimo 113
 
0.8%
other 83
 
0.6%

Most occurring characters

ValueCountFrequency (%)
i 12662
16.3%
e 12644
16.3%
t 12049
15.5%
h 12049
15.5%
W 11935
15.4%
a 2877
 
3.7%
l 1819
 
2.3%
c 1819
 
2.3%
k 1488
 
1.9%
B 1356
 
1.7%
Other values (12) 7037
9.1%
ValueCountFrequency (%)
i 13292
17.5%
e 13262
17.5%
h 12819
16.9%
t 12819
16.9%
W 12736
16.8%
a 1841
 
2.4%
l 1068
 
1.4%
c 1068
 
1.4%
- 886
 
1.2%
n 886
 
1.2%
Other values (12) 5082
 
6.7%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 61355
78.9%
Uppercase Letter 15190
 
19.5%
Dash Punctuation 1190
 
1.5%
ValueCountFrequency (%)
Lowercase Letter 59987
79.2%
Uppercase Letter 14886
 
19.6%
Dash Punctuation 886
 
1.2%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
i 12662
20.6%
e 12644
20.6%
t 12049
19.6%
h 12049
19.6%
a 2877
 
4.7%
l 1819
 
3.0%
c 1819
 
3.0%
k 1488
 
2.4%
n 1190
 
1.9%
s 1058
 
1.7%
Other values (4) 1700
 
2.8%
ValueCountFrequency (%)
i 13292
22.2%
e 13262
22.1%
h 12819
21.4%
t 12819
21.4%
a 1841
 
3.1%
l 1068
 
1.8%
c 1068
 
1.8%
n 886
 
1.5%
k 851
 
1.4%
s 773
 
1.3%
Other values (4) 1308
 
2.2%
Uppercase Letter
ValueCountFrequency (%)
W 11935
78.6%
B 1356
 
8.9%
A 595
 
3.9%
I 595
 
3.9%
P 463
 
3.0%
E 132
 
0.9%
O 114
 
0.8%
ValueCountFrequency (%)
W 12736
85.6%
B 738
 
5.0%
A 443
 
3.0%
I 443
 
3.0%
P 330
 
2.2%
E 113
 
0.8%
O 83
 
0.6%
Dash Punctuation
ValueCountFrequency (%)
- 1190
100.0%
ValueCountFrequency (%)
- 886
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 76545
98.5%
Common 1190
 
1.5%
ValueCountFrequency (%)
Latin 74873
98.8%
Common 886
 
1.2%

Most frequent character per script

Latin
ValueCountFrequency (%)
i 12662
16.5%
e 12644
16.5%
t 12049
15.7%
h 12049
15.7%
W 11935
15.6%
a 2877
 
3.8%
l 1819
 
2.4%
c 1819
 
2.4%
k 1488
 
1.9%
B 1356
 
1.8%
Other values (11) 5847
7.6%
ValueCountFrequency (%)
i 13292
17.8%
e 13262
17.7%
h 12819
17.1%
t 12819
17.1%
W 12736
17.0%
a 1841
 
2.5%
l 1068
 
1.4%
c 1068
 
1.4%
n 886
 
1.2%
k 851
 
1.1%
Other values (11) 4231
 
5.7%
Common
ValueCountFrequency (%)
- 1190
100.0%
ValueCountFrequency (%)
- 886
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 77735
100.0%
ValueCountFrequency (%)
ASCII 75759
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
i 12662
16.3%
e 12644
16.3%
t 12049
15.5%
h 12049
15.5%
W 11935
15.4%
a 2877
 
3.7%
l 1819
 
2.3%
c 1819
 
2.3%
k 1488
 
1.9%
B 1356
 
1.7%
Other values (12) 7037
9.1%
ValueCountFrequency (%)
i 13292
17.5%
e 13262
17.5%
h 12819
16.9%
t 12819
16.9%
W 12736
16.8%
a 1841
 
2.4%
l 1068
 
1.4%
c 1068
 
1.4%
- 886
 
1.2%
n 886
 
1.2%
Other values (12) 5082
 
6.7%

gender
Categorical

 Original DataSynthetic Data
Distinct22
Distinct (%)< 0.1%< 0.1%
Missing00
Missing (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
Male
9372 
Female
4628 
Male
9053 
Female
4947 

Length

 Original DataSynthetic Data
Max length66
Median length44
Mean length4.66114294.7067143
Min length44

Characters and Unicode

 Original DataSynthetic Data
Total characters6525665894
Distinct characters66
Distinct categories22 ?
Distinct scripts11 ?
Distinct blocks11 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

 Original DataSynthetic Data
Unique00 ?
Unique (%)0.0%0.0%

Sample

 Original DataSynthetic Data
1st rowMaleMale
2nd rowMaleMale
3rd rowMaleMale
4th rowMaleMale
5th rowMaleMale

Common Values

ValueCountFrequency (%)
Male 9372
66.9%
Female 4628
33.1%
ValueCountFrequency (%)
Male 9053
64.7%
Female 4947
35.3%

Length

2023-01-21T11:11:27.251487image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

Original Data

2023-01-21T11:11:27.453343image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:27.606559image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
ValueCountFrequency (%)
male 9372
66.9%
female 4628
33.1%
ValueCountFrequency (%)
male 9053
64.7%
female 4947
35.3%

Most occurring characters

ValueCountFrequency (%)
e 18628
28.5%
a 14000
21.5%
l 14000
21.5%
M 9372
14.4%
F 4628
 
7.1%
m 4628
 
7.1%
ValueCountFrequency (%)
e 18947
28.8%
a 14000
21.2%
l 14000
21.2%
M 9053
13.7%
F 4947
 
7.5%
m 4947
 
7.5%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 51256
78.5%
Uppercase Letter 14000
 
21.5%
ValueCountFrequency (%)
Lowercase Letter 51894
78.8%
Uppercase Letter 14000
 
21.2%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e 18628
36.3%
a 14000
27.3%
l 14000
27.3%
m 4628
 
9.0%
ValueCountFrequency (%)
e 18947
36.5%
a 14000
27.0%
l 14000
27.0%
m 4947
 
9.5%
Uppercase Letter
ValueCountFrequency (%)
M 9372
66.9%
F 4628
33.1%
ValueCountFrequency (%)
M 9053
64.7%
F 4947
35.3%

Most occurring scripts

ValueCountFrequency (%)
Latin 65256
100.0%
ValueCountFrequency (%)
Latin 65894
100.0%

Most frequent character per script

Latin
ValueCountFrequency (%)
e 18628
28.5%
a 14000
21.5%
l 14000
21.5%
M 9372
14.4%
F 4628
 
7.1%
m 4628
 
7.1%
ValueCountFrequency (%)
e 18947
28.8%
a 14000
21.2%
l 14000
21.2%
M 9053
13.7%
F 4947
 
7.5%
m 4947
 
7.5%

Most occurring blocks

ValueCountFrequency (%)
ASCII 65256
100.0%
ValueCountFrequency (%)
ASCII 65894
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e 18628
28.5%
a 14000
21.5%
l 14000
21.5%
M 9372
14.4%
F 4628
 
7.1%
m 4628
 
7.1%
ValueCountFrequency (%)
e 18947
28.8%
a 14000
21.2%
l 14000
21.2%
M 9053
13.7%
F 4947
 
7.5%
m 4947
 
7.5%

capital_gain
Categorical

 Original DataSynthetic Data
Distinct109108
Distinct (%)0.8%0.8%
Missing02
Missing (%)0.0%< 0.1%
Memory size109.5 KiB109.5 KiB
0
12811 
15024
 
161
7688
 
127
7298
 
108
99999
 
76
Other values (104)
 
717
0
13079 
7298
 
231
15024
 
137
7688
 
112
5178
 
41
Other values (103)
 
398

Length

Max length6
Median length1
Mean length1.210673
Min length1

Characters and Unicode

Total characters16947
Distinct characters11
Distinct categories2 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

 Original DataSynthetic Data
Unique1951 ?
Unique (%)0.1%0.4%

Sample

1st row0
2nd row0
3rd row0
4th row0
5th row0

Common Values

ValueCountFrequency (%)
0 12811
91.5%
15024 161
 
1.1%
7688 127
 
0.9%
7298 108
 
0.8%
99999 76
 
0.5%
3103 46
 
0.3%
5178 43
 
0.3%
5013 33
 
0.2%
4386 29
 
0.2%
2174 27
 
0.2%
Other values (99) 539
 
3.9%
ValueCountFrequency (%)
0 13079
93.4%
7298 231
 
1.7%
15024 137
 
1.0%
7688 112
 
0.8%
5178 41
 
0.3%
4650 33
 
0.2%
99999 29
 
0.2%
4386 28
 
0.2%
5013 24
 
0.2%
3103 17
 
0.1%
Other values (98) 267
 
1.9%

Length

2023-01-21T11:11:27.764468image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

Original Data


Number of variable categories passes threshold (config.plot.cat_freq.max_unique)

Synthetic Data


Number of variable categories passes threshold (config.plot.cat_freq.max_unique)
ValueCountFrequency (%)
0 13080
93.4%
7298 231
 
1.7%
15024 137
 
1.0%
7688 112
 
0.8%
5178 41
 
0.3%
4650 33
 
0.2%
99999 29
 
0.2%
4386 28
 
0.2%
5013 24
 
0.2%
3103 17
 
0.1%
Other values (98) 267
 
1.9%

Most occurring characters

ValueCountFrequency (%)
0 13397
79.1%
8 634
 
3.7%
7 474
 
2.8%
2 474
 
2.8%
9 429
 
2.5%
1 404
 
2.4%
4 368
 
2.2%
5 322
 
1.9%
6 260
 
1.5%
3 184
 
1.1%

Most occurring categories

ValueCountFrequency (%)
Decimal Number 16946
> 99.9%
Space Separator 1
 
< 0.1%

Most frequent character per category

Decimal Number
ValueCountFrequency (%)
0 13397
79.1%
8 634
 
3.7%
7 474
 
2.8%
2 474
 
2.8%
9 429
 
2.5%
1 404
 
2.4%
4 368
 
2.2%
5 322
 
1.9%
6 260
 
1.5%
3 184
 
1.1%
Space Separator
ValueCountFrequency (%)
1
100.0%

Most occurring scripts

ValueCountFrequency (%)
Common 16947
100.0%

Most frequent character per script

Common
ValueCountFrequency (%)
0 13397
79.1%
8 634
 
3.7%
7 474
 
2.8%
2 474
 
2.8%
9 429
 
2.5%
1 404
 
2.4%
4 368
 
2.2%
5 322
 
1.9%
6 260
 
1.5%
3 184
 
1.1%

Most occurring blocks

ValueCountFrequency (%)
ASCII 16947
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
0 13397
79.1%
8 634
 
3.7%
7 474
 
2.8%
2 474
 
2.8%
9 429
 
2.5%
1 404
 
2.4%
4 368
 
2.2%
5 322
 
1.9%
6 260
 
1.5%
3 184
 
1.1%

capital_loss
Real number (ℝ)

 Original DataSynthetic Data
Distinct7655
Distinct (%)0.5%0.4%
Missing00
Missing (%)0.0%0.0%
Infinite00
Infinite (%)0.0%0.0%
Mean84.93021446.901357
 Original DataSynthetic Data
Minimum00
Maximum390025485
Zeros1335413659
Zeros (%)95.4%97.6%
Negative00
Negative (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
2023-01-21T11:11:28.007781image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Quantile statistics

 Original DataSynthetic Data
Minimum00
5-th percentile00
Q100
median00
Q300
95-th percentile00
Maximum390025485
Range390025485
Interquartile range (IQR)00

Descriptive statistics

 Original DataSynthetic Data
Standard deviation394.66496387.47855
Coefficient of variation (CV)4.64693248.2615637
Kurtosis20.3405821676.9944
Mean84.93021446.901357
Median Absolute Deviation (MAD)00
Skewness4.61508430.0347
Sum1189023656619
Variance155760.43150139.63
MonotonicityNot monotonicNot monotonic
2023-01-21T11:11:28.295087image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0 13354
95.4%
1902 82
 
0.6%
1887 66
 
0.5%
1977 61
 
0.4%
1485 26
 
0.2%
1590 23
 
0.2%
1740 22
 
0.2%
2415 22
 
0.2%
1602 21
 
0.1%
1876 20
 
0.1%
Other values (66) 303
 
2.2%
ValueCountFrequency (%)
0 13659
97.6%
1902 64
 
0.5%
1977 57
 
0.4%
1485 24
 
0.2%
1672 23
 
0.2%
1887 22
 
0.2%
1876 17
 
0.1%
2377 11
 
0.1%
1590 11
 
0.1%
2002 9
 
0.1%
Other values (45) 103
 
0.7%
ValueCountFrequency (%)
0 13354
95.4%
213 1
 
< 0.1%
323 2
 
< 0.1%
419 2
 
< 0.1%
625 9
 
0.1%
653 1
 
< 0.1%
810 2
 
< 0.1%
880 3
 
< 0.1%
974 1
 
< 0.1%
1092 4
 
< 0.1%
ValueCountFrequency (%)
0 13659
97.6%
2 1
 
< 0.1%
176 1
 
< 0.1%
180 1
 
< 0.1%
200 1
 
< 0.1%
204 1
 
< 0.1%
625 4
 
< 0.1%
810 2
 
< 0.1%
880 1
 
< 0.1%
1051 1
 
< 0.1%
ValueCountFrequency (%)
0 13659
97.6%
2 1
 
< 0.1%
176 1
 
< 0.1%
180 1
 
< 0.1%
200 1
 
< 0.1%
204 1
 
< 0.1%
625 4
 
< 0.1%
810 2
 
< 0.1%
880 1
 
< 0.1%
1051 1
 
< 0.1%
ValueCountFrequency (%)
0 13354
95.4%
213 1
 
< 0.1%
323 2
 
< 0.1%
419 2
 
< 0.1%
625 9
 
0.1%
653 1
 
< 0.1%
810 2
 
< 0.1%
880 3
 
< 0.1%
974 1
 
< 0.1%
1092 4
 
< 0.1%

hours_per_week
Real number (ℝ)

 Original DataSynthetic Data
Distinct9082
Distinct (%)0.6%0.6%
Missing00
Missing (%)0.0%0.0%
Infinite00
Infinite (%)0.0%0.0%
Mean40.24092939.731357
 Original DataSynthetic Data
Minimum11
Maximum99762
Zeros00
Zeros (%)0.0%0.0%
Negative00
Negative (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
2023-01-21T11:11:28.583683image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Quantile statistics

 Original DataSynthetic Data
Minimum11
5-th percentile1616
Q14040
median4040
Q34543
95-th percentile6060
Maximum99762
Range98761
Interquartile range (IQR)53

Descriptive statistics

 Original DataSynthetic Data
Standard deviation12.36806215.4285
Coefficient of variation (CV)0.307350310.38832049
Kurtosis2.9757174767.0054
Mean40.24092939.731357
Median Absolute Deviation (MAD)41.5
Skewness0.2360882717.54265
Sum563373556239
Variance152.96895238.03862
MonotonicityNot monotonicNot monotonic
2023-01-21T11:11:28.873188image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
40 6534
46.7%
50 1173
 
8.4%
45 794
 
5.7%
60 615
 
4.4%
35 563
 
4.0%
20 522
 
3.7%
30 519
 
3.7%
25 309
 
2.2%
55 283
 
2.0%
48 212
 
1.5%
Other values (80) 2476
 
17.7%
ValueCountFrequency (%)
40 6976
49.8%
50 1136
 
8.1%
45 683
 
4.9%
20 603
 
4.3%
30 588
 
4.2%
60 525
 
3.8%
35 454
 
3.2%
25 285
 
2.0%
55 219
 
1.6%
15 213
 
1.5%
Other values (72) 2318
 
16.6%
ValueCountFrequency (%)
1 10
 
0.1%
2 16
 
0.1%
3 21
 
0.1%
4 22
 
0.2%
5 24
 
0.2%
6 28
 
0.2%
7 11
 
0.1%
8 62
0.4%
9 5
 
< 0.1%
10 124
0.9%
ValueCountFrequency (%)
1 10
 
0.1%
2 12
 
0.1%
3 13
 
0.1%
4 3
 
< 0.1%
5 6
 
< 0.1%
6 37
 
0.3%
7 3
 
< 0.1%
8 75
0.5%
10 184
1.3%
11 5
 
< 0.1%
ValueCountFrequency (%)
1 10
 
0.1%
2 12
 
0.1%
3 13
 
0.1%
4 3
 
< 0.1%
5 6
 
< 0.1%
6 37
 
0.3%
7 3
 
< 0.1%
8 75
0.5%
10 184
1.3%
11 5
 
< 0.1%
ValueCountFrequency (%)
1 10
 
0.1%
2 16
 
0.1%
3 21
 
0.1%
4 22
 
0.2%
5 24
 
0.2%
6 28
 
0.2%
7 11
 
0.1%
8 62
0.4%
9 5
 
< 0.1%
10 124
0.9%

native_country
Categorical

 Original DataSynthetic Data
Distinct4248
Distinct (%)0.3%0.3%
Missing00
Missing (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
United-States
12532 
Mexico
 
274
?
 
269
Philippines
 
86
Germany
 
71
Other values (37)
 
768
United-States
12684 
Mexico
 
304
?
 
199
El-Salvador
 
85
Germany
 
68
Other values (43)
 
660

Length

 Original DataSynthetic Data
Max length2626
Median length1313
Mean length12.27685712.372929
Min length11

Characters and Unicode

 Original DataSynthetic Data
Total characters171876173221
Distinct characters4545
Distinct categories66 ?
Distinct scripts22 ?
Distinct blocks11 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

 Original DataSynthetic Data
Unique19 ?
Unique (%)< 0.1%0.1%

Sample

 Original DataSynthetic Data
1st rowUnited-StatesTaiwan
2nd rowUnited-StatesJapan
3rd rowUnited-StatesSouth
4th rowUnited-StatesItaly
5th rowUnited-StatesMexico

Common Values

ValueCountFrequency (%)
United-States 12532
89.5%
Mexico 274
 
2.0%
? 269
 
1.9%
Philippines 86
 
0.6%
Germany 71
 
0.5%
Canada 56
 
0.4%
El-Salvador 48
 
0.3%
India 44
 
0.3%
Puerto-Rico 42
 
0.3%
England 40
 
0.3%
Other values (32) 538
 
3.8%
ValueCountFrequency (%)
United-States 12684
90.6%
Mexico 304
 
2.2%
? 199
 
1.4%
El-Salvador 85
 
0.6%
Germany 68
 
0.5%
Italy 65
 
0.5%
China 64
 
0.5%
Canada 61
 
0.4%
Columbia 45
 
0.3%
Vietnam 41
 
0.3%
Other values (38) 384
 
2.7%

Length

2023-01-21T11:11:29.132271image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

Original Data


Number of variable categories passes threshold (config.plot.cat_freq.max_unique)

Synthetic Data


Number of variable categories passes threshold (config.plot.cat_freq.max_unique)
ValueCountFrequency (%)
united-states 12532
89.5%
mexico 274
 
2.0%
269
 
1.9%
philippines 86
 
0.6%
germany 71
 
0.5%
canada 56
 
0.4%
el-salvador 48
 
0.3%
india 44
 
0.3%
puerto-rico 42
 
0.3%
england 40
 
0.3%
Other values (32) 538
 
3.8%
ValueCountFrequency (%)
united-states 12684
90.6%
mexico 304
 
2.2%
199
 
1.4%
el-salvador 85
 
0.6%
germany 68
 
0.5%
italy 65
 
0.5%
china 64
 
0.5%
canada 61
 
0.4%
columbia 45
 
0.3%
vietnam 41
 
0.3%
Other values (38) 384
 
2.7%

Most occurring characters

ValueCountFrequency (%)
t 37803
22.0%
e 25720
15.0%
a 13639
 
7.9%
i 13446
 
7.8%
n 13141
 
7.6%
d 12814
 
7.5%
- 12663
 
7.4%
s 12641
 
7.4%
S 12626
 
7.3%
U 12540
 
7.3%
Other values (35) 4843
 
2.8%
ValueCountFrequency (%)
t 38309
22.1%
e 25968
15.0%
a 13777
 
8.0%
i 13484
 
7.8%
n 13181
 
7.6%
d 12919
 
7.5%
- 12843
 
7.4%
S 12811
 
7.4%
s 12749
 
7.4%
U 12694
 
7.3%
Other values (35) 4486
 
2.6%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 132510
77.1%
Uppercase Letter 26418
 
15.4%
Dash Punctuation 12663
 
7.4%
Other Punctuation 277
 
0.2%
Open Punctuation 4
 
< 0.1%
Close Punctuation 4
 
< 0.1%
ValueCountFrequency (%)
Lowercase Letter 133526
77.1%
Uppercase Letter 26649
 
15.4%
Dash Punctuation 12843
 
7.4%
Other Punctuation 201
 
0.1%
Open Punctuation 1
 
< 0.1%
Close Punctuation 1
 
< 0.1%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
t 37803
28.5%
e 25720
19.4%
a 13639
 
10.3%
i 13446
 
10.1%
n 13141
 
9.9%
d 12814
 
9.7%
s 12641
 
9.5%
o 606
 
0.5%
c 469
 
0.4%
l 407
 
0.3%
Other values (11) 1824
 
1.4%
ValueCountFrequency (%)
t 38309
28.7%
e 25968
19.4%
a 13777
 
10.3%
i 13484
 
10.1%
n 13181
 
9.9%
d 12919
 
9.7%
s 12749
 
9.5%
o 594
 
0.4%
l 448
 
0.3%
c 447
 
0.3%
Other values (11) 1650
 
1.2%
Dash Punctuation
ValueCountFrequency (%)
- 12663
100.0%
ValueCountFrequency (%)
- 12843
100.0%
Uppercase Letter
ValueCountFrequency (%)
S 12626
47.8%
U 12540
47.5%
M 274
 
1.0%
P 182
 
0.7%
C 156
 
0.6%
G 121
 
0.5%
I 108
 
0.4%
E 101
 
0.4%
R 70
 
0.3%
J 57
 
0.2%
Other values (9) 183
 
0.7%
ValueCountFrequency (%)
S 12811
48.1%
U 12694
47.6%
M 304
 
1.1%
C 195
 
0.7%
E 118
 
0.4%
G 102
 
0.4%
I 82
 
0.3%
P 80
 
0.3%
R 56
 
0.2%
V 42
 
0.2%
Other values (9) 165
 
0.6%
Other Punctuation
ValueCountFrequency (%)
? 269
97.1%
& 8
 
2.9%
ValueCountFrequency (%)
? 199
99.0%
& 2
 
1.0%
Open Punctuation
ValueCountFrequency (%)
( 4
100.0%
ValueCountFrequency (%)
( 1
100.0%
Close Punctuation
ValueCountFrequency (%)
) 4
100.0%
ValueCountFrequency (%)
) 1
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 158928
92.5%
Common 12948
 
7.5%
ValueCountFrequency (%)
Latin 160175
92.5%
Common 13046
 
7.5%

Most frequent character per script

Latin
ValueCountFrequency (%)
t 37803
23.8%
e 25720
16.2%
a 13639
 
8.6%
i 13446
 
8.5%
n 13141
 
8.3%
d 12814
 
8.1%
s 12641
 
8.0%
S 12626
 
7.9%
U 12540
 
7.9%
o 606
 
0.4%
Other values (30) 3952
 
2.5%
ValueCountFrequency (%)
t 38309
23.9%
e 25968
16.2%
a 13777
 
8.6%
i 13484
 
8.4%
n 13181
 
8.2%
d 12919
 
8.1%
S 12811
 
8.0%
s 12749
 
8.0%
U 12694
 
7.9%
o 594
 
0.4%
Other values (30) 3689
 
2.3%
Common
ValueCountFrequency (%)
- 12663
97.8%
? 269
 
2.1%
& 8
 
0.1%
( 4
 
< 0.1%
) 4
 
< 0.1%
ValueCountFrequency (%)
- 12843
98.4%
? 199
 
1.5%
& 2
 
< 0.1%
( 1
 
< 0.1%
) 1
 
< 0.1%

Most occurring blocks

ValueCountFrequency (%)
ASCII 171876
100.0%
ValueCountFrequency (%)
ASCII 173221
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
t 37803
22.0%
e 25720
15.0%
a 13639
 
7.9%
i 13446
 
7.8%
n 13141
 
7.6%
d 12814
 
7.5%
- 12663
 
7.4%
s 12641
 
7.4%
S 12626
 
7.3%
U 12540
 
7.3%
Other values (35) 4843
 
2.8%
ValueCountFrequency (%)
t 38309
22.1%
e 25968
15.0%
a 13777
 
8.0%
i 13484
 
7.8%
n 13181
 
7.6%
d 12919
 
7.5%
- 12843
 
7.4%
S 12811
 
7.4%
s 12749
 
7.4%
U 12694
 
7.3%
Other values (35) 4486
 
2.6%

income_bracket
Categorical

 Original DataSynthetic Data
Distinct22
Distinct (%)< 0.1%< 0.1%
Missing00
Missing (%)0.0%0.0%
Memory size109.5 KiB109.5 KiB
<=50K
10660 
>50K
3340 
<=50K
11477 
>50K
2523 

Length

 Original DataSynthetic Data
Max length55
Median length55
Mean length4.76142864.8197857
Min length44

Characters and Unicode

 Original DataSynthetic Data
Total characters6666067477
Distinct characters66
Distinct categories33 ?
Distinct scripts22 ?
Distinct blocks11 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

 Original DataSynthetic Data
Unique00 ?
Unique (%)0.0%0.0%

Sample

 Original DataSynthetic Data
1st row<=50K>50K
2nd row>50K<=50K
3rd row<=50K<=50K
4th row<=50K<=50K
5th row<=50K<=50K

Common Values

ValueCountFrequency (%)
<=50K 10660
76.1%
>50K 3340
 
23.9%
ValueCountFrequency (%)
<=50K 11477
82.0%
>50K 2523
 
18.0%

Length

2023-01-21T11:11:29.333334image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

Original Data

2023-01-21T11:11:29.533927image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:29.686409image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
ValueCountFrequency (%)
50k 14000
100.0%
ValueCountFrequency (%)
50k 14000
100.0%

Most occurring characters

ValueCountFrequency (%)
5 14000
21.0%
0 14000
21.0%
K 14000
21.0%
< 10660
16.0%
= 10660
16.0%
> 3340
 
5.0%
ValueCountFrequency (%)
5 14000
20.7%
0 14000
20.7%
K 14000
20.7%
< 11477
17.0%
= 11477
17.0%
> 2523
 
3.7%

Most occurring categories

ValueCountFrequency (%)
Decimal Number 28000
42.0%
Math Symbol 24660
37.0%
Uppercase Letter 14000
21.0%
ValueCountFrequency (%)
Decimal Number 28000
41.5%
Math Symbol 25477
37.8%
Uppercase Letter 14000
20.7%

Most frequent character per category

Decimal Number
ValueCountFrequency (%)
5 14000
50.0%
0 14000
50.0%
ValueCountFrequency (%)
5 14000
50.0%
0 14000
50.0%
Uppercase Letter
ValueCountFrequency (%)
K 14000
100.0%
ValueCountFrequency (%)
K 14000
100.0%
Math Symbol
ValueCountFrequency (%)
< 10660
43.2%
= 10660
43.2%
> 3340
 
13.5%
ValueCountFrequency (%)
< 11477
45.0%
= 11477
45.0%
> 2523
 
9.9%

Most occurring scripts

ValueCountFrequency (%)
Common 52660
79.0%
Latin 14000
 
21.0%
ValueCountFrequency (%)
Common 53477
79.3%
Latin 14000
 
20.7%

Most frequent character per script

Common
ValueCountFrequency (%)
5 14000
26.6%
0 14000
26.6%
< 10660
20.2%
= 10660
20.2%
> 3340
 
6.3%
ValueCountFrequency (%)
5 14000
26.2%
0 14000
26.2%
< 11477
21.5%
= 11477
21.5%
> 2523
 
4.7%
Latin
ValueCountFrequency (%)
K 14000
100.0%
ValueCountFrequency (%)
K 14000
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 66660
100.0%
ValueCountFrequency (%)
ASCII 67477
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
5 14000
21.0%
0 14000
21.0%
K 14000
21.0%
< 10660
16.0%
= 10660
16.0%
> 3340
 
5.0%
ValueCountFrequency (%)
5 14000
20.7%
0 14000
20.7%
K 14000
20.7%
< 11477
17.0%
= 11477
17.0%
> 2523
 
3.7%

Interactions

Original Data

2023-01-21T11:11:13.305761image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:20.443478image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:06.983202image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:16.738950image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:08.137852image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:17.641041image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:09.291059image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:18.572977image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:11.005745image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data


Interaction plot not present for dataset

Original Data

2023-01-21T11:11:12.149935image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:19.528530image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:13.497309image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:20.631336image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:07.171611image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:16.901628image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:08.328635image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:17.811832image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:09.494688image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:18.762927image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:11.201532image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data


Interaction plot not present for dataset

Original Data

2023-01-21T11:11:12.339600image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:19.713374image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:13.705091image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:20.841620image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:07.379464image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:17.076306image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:08.533609image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:18.003915image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:09.701630image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:18.952194image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:11.387661image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data


Interaction plot not present for dataset

Original Data

2023-01-21T11:11:12.557907image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:19.901497image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:13.897840image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data


Interaction plot not present for dataset

Original Data

2023-01-21T11:11:07.583097image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data


Interaction plot not present for dataset

Original Data

2023-01-21T11:11:08.727117image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data


Interaction plot not present for dataset

Original Data

2023-01-21T11:11:09.887880image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data


Interaction plot not present for dataset

Original Data

2023-01-21T11:11:11.581054image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data


Interaction plot not present for dataset

Original Data

2023-01-21T11:11:12.749408image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data


Interaction plot not present for dataset

Original Data

2023-01-21T11:11:14.072806image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:21.039832image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:07.769225image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:17.263818image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:08.905519image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:18.192675image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:10.639581image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:19.141699image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:11.777232image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data


Interaction plot not present for dataset

Original Data

2023-01-21T11:11:12.931381image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:20.082255image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:14.260154image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:21.542879image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:07.957144image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:17.437985image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:09.109435image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:18.378834image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:10.830143image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:19.329481image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

2023-01-21T11:11:11.975043image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data


Interaction plot not present for dataset

Original Data

2023-01-21T11:11:13.122623image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:20.252701image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Correlations

Original Data

2023-01-21T11:11:29.822678image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Synthetic Data

2023-01-21T11:11:30.097885image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/

Original Data

agefnlwgteducation_numcapital_gaincapital_losshours_per_weekworkclasseducationmarital_statusoccupationrelationshipracegendernative_countryincome_bracket
age1.000-0.0810.0610.1320.0520.1520.1240.1190.2810.1230.2790.0220.1280.0320.321
fnlwgt-0.0811.000-0.037-0.016-0.008-0.0330.0180.0190.0160.0100.0210.0720.0410.0450.001
education_num0.061-0.0371.0000.1260.0750.1670.1001.0000.0830.2240.1110.0620.0730.1420.365
capital_gain0.132-0.0160.1261.000-0.0670.0920.0430.1590.0400.0700.0490.0000.0580.0000.269
capital_loss0.052-0.0080.075-0.0671.0000.0510.0210.0450.0360.0300.0510.0000.0530.0000.141
hours_per_week0.152-0.0330.1670.0920.0511.0000.1230.0890.1170.1430.1610.0540.2380.0220.267
workclass0.1240.0180.1000.0430.0210.1231.0000.1090.0830.4270.0970.0530.1400.0280.170
education0.1190.0191.0000.1590.0450.0890.1091.0000.0950.1870.1220.0700.0910.1340.371
marital_status0.2810.0160.0830.0400.0360.1170.0830.0951.0000.1320.4890.0830.4520.0630.449
occupation0.1230.0100.2240.0700.0300.1430.4270.1870.1321.0000.1790.0760.4230.0620.350
relationship0.2790.0210.1110.0490.0510.1610.0970.1220.4890.1791.0000.0990.6480.0790.455
race0.0220.0720.0620.0000.0000.0540.0530.0700.0830.0760.0991.0000.1070.3890.102
gender0.1280.0410.0730.0580.0530.2380.1400.0910.4520.4230.6480.1071.0000.0420.213
native_country0.0320.0450.1420.0000.0000.0220.0280.1340.0630.0620.0790.3890.0421.0000.087
income_bracket0.3210.0010.3650.2690.1410.2670.1700.3710.4490.3500.4550.1020.2130.0871.000

Synthetic Data

agefnlwgteducation_numcapital_losshours_per_weekworkclasseducationmarital_statusoccupationrelationshipracegendernative_countryincome_bracket
age1.000-0.0470.031-0.0090.0040.1220.1190.0000.0000.0070.0000.0000.0000.005
fnlwgt-0.0471.000-0.0280.004-0.0070.0060.0000.0000.0150.0000.0000.0000.0000.000
education_num0.031-0.0281.000-0.031-0.0060.0861.0000.0000.0190.0000.0000.0000.0520.080
capital_loss-0.0090.004-0.0311.0000.0580.0160.0360.0000.0000.0000.0000.0000.0420.000
hours_per_week0.004-0.007-0.0060.0581.0000.0260.0000.0000.0000.0440.0000.0220.0000.020
workclass0.1220.0060.0860.0160.0261.0000.0950.0000.0180.0150.0140.0000.0250.076
education0.1190.0001.0000.0360.0000.0951.0000.0080.0130.0120.0150.0000.0480.116
marital_status0.0000.0000.0000.0000.0000.0000.0081.0000.1110.4120.0320.4150.0000.139
occupation0.0000.0150.0190.0000.0000.0180.0130.1111.0000.1590.0130.2320.0170.091
relationship0.0070.0000.0000.0000.0440.0150.0120.4120.1591.0000.0810.6370.0000.157
race0.0000.0000.0000.0000.0000.0140.0150.0320.0130.0811.0000.0850.0210.000
gender0.0000.0000.0000.0000.0220.0000.0000.4150.2320.6370.0851.0000.0350.061
native_country0.0000.0000.0520.0420.0000.0250.0480.0000.0170.0000.0210.0351.0000.086
income_bracket0.0050.0000.0800.0000.0200.0760.1160.1390.0910.1570.0000.0610.0861.000

Missing values

Original Data

2023-01-21T11:11:14.582442image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
A simple visualization of nullity by column.

Synthetic Data

2023-01-21T11:11:21.859006image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
A simple visualization of nullity by column.

Original Data

2023-01-21T11:11:14.970296image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Synthetic Data

2023-01-21T11:11:22.241638image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
2023-01-21T11:11:22.512596image/svg+xmlMatplotlib v3.6.3, https://matplotlib.org/
The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

Sample

Original Data

ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracegendercapital_gaincapital_losshours_per_weeknative_countryincome_bracket
030?15728911th7Never-married?UnmarriedWhiteMale0040United-States<=50K
133Private170769Doctorate16DivorcedSalesNot-in-familyWhiteMale99999060United-States>50K
237Private279029Bachelors13Never-marriedCraft-repairNot-in-familyWhiteMale0040United-States<=50K
330Private255004Assoc-acdm12DivorcedSalesNot-in-familyWhiteMale0052United-States<=50K
424?144898Some-college10Never-married?UnmarriedWhiteMale0040United-States<=50K
559Private6188512th8DivorcedTransport-movingOther-relativeBlackMale0035United-States<=50K
653Private96062Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0040Greece<=50K
718Private20810311th7Never-marriedOther-serviceOther-relativeWhiteMale0025United-States<=50K
830Private190823Some-college10Never-marriedOther-serviceOwn-childBlackFemale0040United-States<=50K
922?424494Some-college10Never-married?Own-childWhiteMale0025United-States<=50K

Synthetic Data

ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracegendercapital_gaincapital_losshours_per_weeknative_countryincome_bracket
018?97318.011th7Married-civ-spouse?HusbandWhiteMale0015Taiwan>50K
124Private227070.010th6Married-civ-spouse?HusbandWhiteMale0040Japan<=50K
251Private124187.09th5Married-civ-spouse?HusbandWhiteMale0030South<=50K
328Private128509.05th-6th3Married-civ-spouseSalesHusbandWhiteMale0244440Italy<=50K
448Private33155.0HS-grad9Never-married?Own-childWhiteMale0038Mexico<=50K
561Private119986.0Masters14DivorcedOther-serviceUnmarriedWhiteFemale0030Philippines>50K
643Private456236.0Masters14Married-civ-spouseOther-serviceHusbandWhiteMale0060United-States>50K
756Private104945.07th-8th4DivorcedProf-specialtyUnmarriedOtherFemale0060United-States>50K
823Private143582.0HS-grad9Never-marriedSalesOwn-childWhiteMale0045United-States>50K
934Private405284.0Bachelors13Married-civ-spouseProtective-servHusbandWhiteMale0040Philippines<=50K

Original Data

ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracegendercapital_gaincapital_losshours_per_weeknative_countryincome_bracket
1399040Self-emp-not-inc98985HS-grad9DivorcedExec-managerialNot-in-familyBlackMale0050United-States<=50K
1399153Private68684HS-grad9Married-civ-spouseTransport-movingHusbandWhiteMale0040United-States<=50K
1399217Private36561310th6Never-marriedOther-serviceOwn-childWhiteMale0010Canada<=50K
1399350Private155594Assoc-acdm12Married-civ-spouseMachine-op-inspctHusbandWhiteMale0050United-States>50K
1399427Private69757Some-college10Never-marriedAdm-clericalNot-in-familyWhiteFemale0060United-States<=50K
1399539Local-gov178100Masters14DivorcedProf-specialtyUnmarriedWhiteFemale0040United-States<=50K
1399640Private226608Some-college10DivorcedTech-supportNot-in-familyWhiteMale0030Guatemala>50K
1399737Private295127HS-grad9Never-marriedMachine-op-inspctNot-in-familyWhiteMale0040United-States<=50K
1399846Private1496407th-8th4Married-spouse-absentTransport-movingNot-in-familyWhiteMale0045United-States<=50K
1399957Private109015Some-college10DivorcedSalesNot-in-familyWhiteFemale0048United-States<=50K

Synthetic Data

ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracegendercapital_gaincapital_losshours_per_weeknative_countryincome_bracket
1399042Private194772.0Prof-school15Married-civ-spouseHandlers-cleanersHusbandWhiteMale0030United-States<=50K
1399128State-gov132551.0Bachelors13Never-marriedOther-serviceOwn-childWhiteMale0040United-States<=50K
1399253Local-gov34173.0HS-grad9Married-civ-spouseTransport-movingHusbandWhiteMale008United-States<=50K
1399338Local-gov209103.0Bachelors13Married-civ-spouseAdm-clericalHusbandWhiteMale0010United-States<=50K
1399446Private48885.0HS-grad9Married-civ-spouseTransport-movingHusbandWhiteMale0050United-States<=50K
1399544Private193882.0HS-grad9Divorced?UnmarriedWhiteFemale2202040United-States<=50K
1399640Private99185.0Assoc-voc11DivorcedSalesUnmarriedWhiteFemale0040?<=50K
1399741Private121130.0HS-grad9Married-civ-spouseSalesHusbandWhiteMale0044Canada<=50K
1399847Private201734.0Assoc-voc11Married-civ-spouseSalesHusbandWhiteMale0040Mexico<=50K
1399934Private34848.0Some-college10Married-civ-spouseSalesHusbandWhiteMale0040United-States<=50K

Duplicate rows

Original Data

ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracegendercapital_gaincapital_losshours_per_weeknative_countryincome_bracket# duplicates
225Private1959941st-4th2Never-marriedPriv-house-servNot-in-familyWhiteFemale0040Guatemala<=50K3
021Private250051Some-college10Never-marriedProf-specialtyOwn-childWhiteFemale0010United-States<=50K2
123Private2401375th-6th3Never-marriedHandlers-cleanersNot-in-familyWhiteMale0055Mexico<=50K2
325Private308144Bachelors13Never-marriedCraft-repairNot-in-familyWhiteMale0040Mexico<=50K2
427Private255582HS-grad9Never-marriedMachine-op-inspctNot-in-familyWhiteFemale0040United-States<=50K2
528Private274679Masters14Never-marriedProf-specialtyNot-in-familyWhiteMale0050United-States<=50K2
642Private204235Some-college10Married-civ-spouseProf-specialtyHusbandWhiteMale0040United-States>50K2

Synthetic Data

ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracegendercapital_gaincapital_losshours_per_weeknative_countryincome_bracket# duplicates
019?124651.011th7Married-civ-spouseProf-specialtyHusbandWhiteMale0040United-States<=50K2
130Private196396.0Some-college10DivorcedSalesNot-in-familyWhiteFemale0045United-States<=50K2
233Private206609.0Bachelors13Married-civ-spouseSalesHusbandWhiteMale0040United-States<=50K2
345Private266860.0Masters14Married-civ-spouseExec-managerialHusbandWhiteMale0040United-States<=50K2